Relevance of Cluster size in MMR based Summarizer : A Report
نویسندگان
چکیده
ion of documents by humans is complex to model as is any other information processing by humans. The abstracts differ from person to person, and usually vary in the style, language and detail. The process of abstraction is complex to be formulated mathematically or logically [14]. In the last decade some systems have been developed that generate abstractions using the latest natural language processing tools. These systems extract phrases and lexical chains from the documents and fuse them together with generative tools to produce a summary (or abstraction). A comparatively less complex approach is to make an extractive summary in which sentences from the original documents are selected and presented together as a summary. PROBLEMS WITH EXTRACTIVE METHODS: • Extracted sentences usually tend to be longer than average. Due to this, part of the segments that are not essential for summary also get included, consuming space. • Important or relevant information is usually spread across sentences, and extractive summaries cannot capture this (unless the summary is long enough to hold all those sentences). • Conflicting information may not be presented accurately. PROBLEMS WITH ABSTRACTIVE METHODS: • It has been shown that users prefer extractive summaries instead of glossed-over abstractive summaries [15]. This is because extractive summaries present the information as-is by the author, and would allow the users to read between-thelines information. • Sentence synthesis is not a well-developed field yet, and hence the machine generated automatic summaries would result in incoherence even within a sentence. In case of extractive summaries, incoherence occurs only at the border of two sentences. The work presented in this report is relevant to extractive summaries. In the rest of this section we study some specific methods producing extractive summaries. 6.1 EXTRACTIVE METHODS Extractive summarizers aim at picking out the most relevant sentences in the document while also maintaining a low redundancy in the summary. While anti-redundancy was not explicitly documented in older systems, most of the current systems account for it in their own novel ways. 6.1.1 CLASSICAL METHOD: Though text summarization has drawn attention primarily after the information explosion on the Internet, the seminal work has been done as early as in the 1950’s. Edmundson presents a survey of the then existing methods to automatic summarization in [16] and a Design Scoring schemes for sentences Extract top scoring sentences to form summary Compare summary with manually extracted summary from A Figure 2: Step 2 in of the summarization process elicited by Edmondson: Design sentence scoring metrics such that extracted sentences are close to manually generated summaries of Step 1. Iteratively improve scoring schemes. systematic approach to summarization which forms the core of the extraction methods even today in [7]. Edmundson’s approach is summarized in Figure 2. The key ideas in this approach are: 1. Study human generated abstracts, and specify characteristics expected in automatically generated abstracts 2. Generate such abstracts manually. 3. Design mathematical and logical formulations to score and pick out sentences from the documents to match these manually generated abstracts. A system of recent times, that automatically learns from training documents and their corresponding abstracts is described in [17]. 4. Iteratively improve the sentence-scoring scheme to match the automatic abstracts to manually generated abstracts. Computationally representable features of sentences that are useful to score sentences for potential inclusion in the summaries have been proposed in [7], and are used even in the systems of today. Stop words are removed. Sentences are then scored according to four factors: Cue: Those containing cue words/phrases like conclusion, according to the study, hardly are given a higher weight than those not containing them.
منابع مشابه
LIF at TAC MultiLing: Towards a Truly Language Independent Summarizer
This paper presents the LIF system for the TAC’2011 Multilingual pilot track. We followed a language-independent approach to summarization for this task. In particular, we tried to remove the following dependences to language: sentence segmentation, word segmentation, stop-word lists, and word-level relevance assessment. We applied these modifications to an MMR-based system and observed little ...
متن کاملIS_SUM: A Multi-Document Summarizer based on Document Index Graphic and Lexical Chains
IS_SUM is a summarizer developed at Institute of Software (IS) of Chinese Academy of Sciences for DUC 2005. We adopt a new way for clustering and summarizing documents by integrating Document Index Graphic (DIG) [7] with Lexical Chains [5]. Our results show the benefit of integrating DIG with Lexical Chains.
متن کاملUpdate Summarizer Using MMR Approach
A Huge amount of information is present on the WWW and lot is being added to it constantly. In this context, a query specific text summarization is one of the solutions to solve this problem. In this paper we apply MMR to accomplish the task of update summary generation.
متن کاملCentroid-based summarization of multiple documents
We present a multi-document summarizer, MEAD, which generates summaries using cluster centroids produced by a topic detection and tracking system. We describe two new techniques, a centroid-based summarizer, and an evaluation scheme based on sentence utility and subsumption. We have applied this evaluation to both single and multiple document summaries. Finally, we describe two user studies tha...
متن کاملAccurate user directed summarization from existing tools
1. ABSTRACT This paper describes a set of experimental results produced from the TIPSTER SUMMAC initiative on user directed summaries: document summaries generated in the context of an information need expressed as a query. The summarizer that was evaluated was based on a set of existing statistical techniques that had been applied successfully to the INQUERY retrieval system. The techniques pr...
متن کاملInformation Gain Ratio meets Maximal Marginal Relevance - A method of Summarization for Multiple Documents
In this paper, we propose a method to make a summary from multiple documents with taking account of comprehensibility and readability. As for comprehensibility, we show an integration of MMR into the termweighting method based on IGR. As for readability, we propose a method to generate a summary based on clustering important sentences according to subtopics and making a keyword list as a very b...
متن کامل